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METHOD AND APPARATUS FOR SPEECH CODING WITH 
VOICED/UNVOICED DETERMINATION 

Background of the Invention 

The present invention relates to speech processing, and more particularly to a 
voicing determination of the speech signal having a particular, but not 
exclusive, application to the field of mobile telephones. 

In known speech codecs the most common phonetic classification is a voicing 
decision, which classifies a speech frame as voiced or unvoiced. Generally 
speaking, voiced segments are typically associated with high local energy and 
exhibit a distinct periodicity corresponding to the fundamental frequency, or 
equivalently pitch, of the speech signal, whereas unvoiced segments 
resemble noise. However, speech signal also contains segments, which can 
be classified as a mixture of voiced and unvoiced speech where both 
components are present simultaneously. This category includes voiced 
fricatives and breathy and creaky voices. The appropriate classification of 
mixed segments as either voiced or unvoiced depends on the properties of 
the speech codec. 

In a typical known analysis-by-synthesis (A-b-S) based speech codec the 
periodicity of speech is modelled with a pitch predictor filter, also referred to 
as a long-term prediction (LTP) filter. It characterises the harmonic structure 
of the spectrum based on the similarity of adjacent pitch periods in a speech 
signal. The most common method used for pitch extraction is the 
autocorrelation analysis, which indicates the similarity between the present 
and delayed speech segments. In this approach the lag value corresponding 
to the major peak of the autocorrelation function is interpreted as the pitch 
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period. It is typical that for voiced speech segments with a clear pitch period 
the voicing determination is closely related to pitch extraction. 

According to a first aspect of the present invention there is provided a method 
5 for determining the voicing of a speech signal segment, comprising the steps 
of: dividing a speech signal segment into sub-segments, determining a value 
relating to the voicing of respective speech signal sub-segments, comparing 
said values with a predetermined threshold, and making a decision on the 
voicing of the speech segment based on the number of the values on one 
1 0 side of the threshold. 

According to a second aspect of the present invention there is provided a 
device for determining the voicing of a speech signal segment, comprising 
means (106) for dividing a speech signal segment into sub-segments, means 
15 (110) for determining a value relating to the voicing of respective speech 
signal sub-segments, means (112) for comparing said values with a 
predetermined threshold and means (112) for making a decision on the 
voicing of the speech segment based on the number of the values on one 
side of the threshold. 
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Summary of the Invention 

The invention provides a method for voicing determination to be used 
particularly, but not exclusively, in a narrow-band speech coding system. An 
aim of the invention is to address the problems of prior art by determining the 
25 voicing of the speech segment based on the periodicity of its sub-segments. 
The embodiments of the present invention give an improvement in the 
operation in a situation where the properties of the speech signal vary rapidly 
such that the single parameter set computed over a long window does not 
provide a reliable basis for voicing determination. 
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A preferred embodiment of the voicing determination of the present invention 
divides a segment of speech signal further into sub-segments. Typically the 
speech signal segment comprises one speech frame. Furthermore, it may 
optionally include a possible lookahead which is a certain portion of the 
speech signal from the next speech frame. A normalised autocorrelation is 
computed for each sub-segment. The normalised autocorrelation values of 
these sub-segments are forwarded to classification logic, which compares 
them to the predefined threshold value. In this embodiment, if a certain 
percentage of normalised autocorrelation values exceeds a threshold, the 
segment is classified as voiced. 

In one embodiment of the present invention, a normalised autocorrelation is 
computed for each sub-segment using a window whose length is proportional 
to the estimated pitch period. This ensures that a suitable number of pitch 
periods is included to the window. 

In addition to the above, a critical design problem in voicing determination 
algorithms is the correct classification of transient frames. This is especially 
true in transients from unvoiced to voiced speech as the energy of the speech 
signal is usually growing. If no separate algorithm is designed for classifying 
the transient frames, the voicing determination algorithm is always a 
compromise between the misclassification rate and the sensitivity to detecting 
transient frames appropriately. 

To improve the performance of the voicing determination algorithm during 
transient frames without increasing the misclassification rate practically at all, 
one embodiment of the present invention provides rules for classifying the 
speech frame as voiced. This is done by emphasising the voicing decisions of 
the last sub-segments in a frame to detect the transients from unvoiced to 
voiced speech. That is, in addition to having a certain number of sub- 
segments having a normalised autocorrelation value exceeding a threshold 
value, the frame is classified as voiced also if all of a predetermined number 



of the last sub-segments have a normalised autocorrelation value exceeding 
the same threshold value. Detection of unvoiced to voiced transients is thus 
further improved by emphasising the last sub-segments in the classification 
logic. 

The frame may be classified as voiced if only the last sub-segment has a 
normalised autocorrelation value exceeding the threshold value. 

Alternatively, the frame may be classified as voiced if a portion of the sub- 
segments out of the whole speech frame have a normalised autocorrelation 
value exceeding the threshold. The portion may, for example be substantially 
a half, or substantially a third of the sub-segments of the speech frame. 

The voiced/unvoiced decision can be used for two purposes. One option is to 
allocate bits within the speech codec differently for voiced and unvoiced 
frames. In general, voiced speech segments are perceptually more important 
than unvoiced segments and thus it is especially important that a speech 
frame is correctly classified as voiced. In the case of A-b-S type of codec, this 
can be done e.g. by re-allocating bits from the adaptive codebook (e.g. from 
LTP-gain and LTP-lag parameters) to the excitation signal when the speech 
frame is classified as unvoiced to improve the coding of the excitation signal. 
On the other hand the adaptive codebook in a speech codec can then be 
even switched off during the unvoiced speech frame which will lead to 
reduced total bit rate. Because of this on/off switching of LTP-parameters it is 
especially important that a speech frame is correctly classified as voiced. It 
has been noticed that, if a voiced speech frame is incorrectly classified as 
unvoiced and the LTP parameters are switched off, this leads to a decreased 
sound quality at the receiving end. Accordingly, the present invention provides 
a method and device for a voiced/unvoiced decision to make a reliable 
decision, especially, so that voiced speech frames are not incorrectly decided 
as unvoiced. 



Brief Description of the Drawings 

Exemplary embodiments of the invention are hereinafter described with the 
reference to the accompanying drawings, in which: 

Figure 1 shows a bloci< diagram of an apparatus of the present invention; 
Figure 2 shows a speech signal framing of the present invention; 
Figure 3 shows a flow diagram in accordance with the present invention; 
Figure 4 shows a block diagram of a radiotelephone utilising the invention. 

Detailed Description of the Invention 

Figure 1 shows a device 1 for voicing determination according to the first 
embodiment of the present invention. The device comprises a microphone 
101 for receiving an acoustical signal 102, typically a voice signal, generated 
by a user, and converting it into an analog electrical signal at line 1 03. An A/D 
converter 1 04 receives the analog electrical signal at line 1 03 and produces a 
digital electrical signal y(t) of the user's voice at line 105. A segmentation 
block 1 06 then divides speech signal to predefined sub-segments at line 1 07. 
A frame of 20 ms (160 samples) can for example divided into 4 sub-segments 
of 5 ms. After segmentation a pitch extraction block 108 extracts the optimum 
open-loop pitch period for each speech sub-segment. The optimum open-loop 
pitch is estimated by minimising the sum-squared error between the speech 
segment and its delayed and gain-scaled version as following: 

J{t,T,g{t)) - l.{y{t+i)- g{t)y{t + i-T)y (1) 

where y(t) is the first speech sample belonging to the window of length N, t Is 
the integer pitch period and g(t) is the gain. 

The optimum value of g(t) is found by setting the partial derivative of the cost 
function (1) with respect to the gain equal to zero. This yields 
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where 
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R{t,T) = J^y{t+i)y{t + i-T) 



(3) 



is the autocorrelation of y(t) with delay t and, 
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R(t) = R(t,0) = ^y^(t + i) 



(4) 



By substituting the optimum gain to equation (1), the pitch period is estimated 
by maximising the latter term of 



with respect to delay t. The pitch extraction block 108 is also arranged to 
send the above determined estimated open-loop pitch estimate t at line 113 
to the segmentation block 106 and to a value determination block 110. An 
example of the operation of the segmentation is shown in figure 2, which is 
described later. 

The value determination block 110 also receives the speech signal y(t) from 
the segmentation block 106 at line 107. The value determination block 110 is 
arranged to operate as following: 



J(t,T) =R(t)- 
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To eliminate tine effects of tine negative values of the autocorrelation function 
when maximising the function, a square root of the latter term of equation (5) 
is taken. The term to be maximised is thus: 



C^(t,T) = R(t,T)/^Rit-T) 



(6) 



During voiced segments the gain g(t) tends to be near unity and thus It is 
often used for voicing determination. However, during unvoiced and transient 
regions the gain g(t) fluctuates achieving also values near unity. A more 
robust voicing determination is achieved by observing the values of equation 
(6). To cope with the power variations of the signal, R{t,T) is normalised to 
have a maximum value of unity resulting: 



According to one aspect of the invention the window length in (7) is set to the 
found pitch period x plus some offset M to overcome the problems related to a 
fixed-length window. The periodicity measure used is thus 





(8) 



where 
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and 
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(10) 



; = 0 
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The parameter M can be set, e.g. to 10 samples. A voicing decision block 112 
is to receive the above determined periodicity measure C2(t, t) at line 1 1 1 
from the value determination block 1 10 and parameters K, Ktr, Ctr to make the 
5 voicing decision. The decision logic of voiced/unvoiced decision Is further 
described in figure 3 below. 

It should be emphasised that the pitch period used in (8) can also be 
estimated in other ways than described in equations (1) - (6) above. A 
10 common modification is to use pitch tracking in order to avoid pitch multiples 
described in a Finnish patent application Fl 971976. Another optional function 
for the open-loop pitch extraction is that the effect of the formant frequencies 
is removed from the speech signal before pitch extraction. This can be done 
for example by a weighting filter. 
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Modified signals e.g. residual signal, weighted residual signal or weighted 
speech signal, can also be used for voicing determination instead of the 
original speech signal. Residual signal is obtained by filtering the original 
speech signal by linear prediction analysis filter. 
20 It may also be advantageous to estimate the pitch period from the residual 
signal of the linear prediction filter instead of the speech signal, because the 
residual signal is often more clearly periodic. 

Residual can be further low-pass filtered and down-sampled before the above 
25 procedure. Down-sampling reduces the complexity of correlation computation. 
In one further example the speech signal is first filtered by a weighting filter 
before the calculation of autocorrelation is applied as described above. 

Figure 2 shows an example of dividing a speech frame into four sub- 
30 segments whose starting positions are t1 , t2, t3 and t4. The window lengths 
N1, N2, N3 and N4 are proportional to the pitch period found as described 



above. The lookahead is also utilised in the segmentation. In this example, 
the number of sub-segments is fixed. Altematively the number of sub- 
segments can variable based on the pitch period. This can be done for 
example by selecting the subsegments by t2= t1 + x + L, t3= t2 + t + L, etc. 
until all available data is utilised. In this example L is constant and can be set 
e.g. -10 resulting overlapping sub-segments. 

Figure 3 shows a flow diagram of the method according to one embodiment of 
the present invention. The procedure is started by step 301 where the open- 
loop pitch period t is extracted as exemplified above In equations (1) - (6). At 
step 302 C2(t, t) is calculated for each sub-segment of the speech as 
described in equation (8). Next at step 303 the number of sub-segments n is 
calculated where C2(t, t) is above a certain first threshold value Qr. The 
comparator 304 determines whether the number of sub-segments n, 
determined at step 303, exceeds a certain second threshold value K. If the 
second threshold value K Is exceeded the speech frame is classified as 
voiced. OthenA/ise the procedure continues to step 305. In this embodiment, at 
step 305 the comparator determines if a certain number Ktr of last sub- 
segments have a value CaCt, t) exceeding the threshold Ctr. If the threshold is 
exceeded the speech frame is classified as a voiced frame. Otherwise the 
speech frame is classified as unvoiced frame. 

The exact parameter values Ctr, Ktr and K presented above are not limited to 
certain values but are dependent on the system specified and can be selected 
empirically using a large speech database. For example, if the speech 
segment is divided into 9 sub-segments suitable values can be e.g. Qr = 0.6, 
Ktr = 4 and K= 6. An appropriate value of K and Ktr is proportional to the 
number of sub-segments. 

Alternatively, according to present invention, the frame is classified as voiced 
if only the last sub-segment (i.e. Ktr = 1) has a normalised autocorrelation 
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value exceeding the threshold value. According to still one modification the 
frame is classified as voiced if substantially half of the sub-segments out of 
the whole speech frame (e.g. 4 or 5 sub-segments out of 9) have a 
normalised autocorrelation value exceeding the threshold. 

Figure 4 is a block figure of a radiotelephone describing the relevant parts for 
the present invention. The radiotelephone comprises of a microphone 61, 
keypad 62, display 63, speaker 64 and antenna 71 with switch for duplex 
operation. Further included is a control unit 65, implemented for example in 
an ASIC circuit, for controlling the operation of the radiotelephone. Figure 3 
also shows the transmission and reception blocks 67, 68 including speech 
encoder and decoder blocks 69, 70. The device for voicing determination 1 is 
preferably included within the speech encoder 69. Alternatively the voicing 
determination can be Implemented separately, not within the speech encoder 
69. The speech encoder/decoder blocks 69, 70 and the voicing determination 
1 can be implemented by a DSP circuit Including the elements known as 
such, e.g. internal/external memories and registers, for Implementing the 
present invention. The speech encoder/decoder can be based on any 
standard/technology and the present invention thus forms one part for the 
operation of such codec. The radiotelephone Itself can operate in any existing 
or future telecommunication standard based on digital technology. 
In the view of foregoing description it will be evident to a person skilled in the 
art that various modifications may be made within the scope of the present 
invention. 

What is claimed is 
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1. A method for determining the voicing of a speech signal segment, 
comprising the steps of: dividing a speech signal segment into sub-segments, 
determining a value relating to the voicing of respective speech signal sub- 
segments, comparing said values with a predetermined threshold, and making 
a decision on the voicing of the speech segment based on the number of the 
values on one side of the threshold. 

2. A method of claim 1 , wherein said step of making a decision is based on 
whether the value relating to the voicing of the last sub-segment is on the one 
side of the threshold. 

3. A method of claim 1 , wherein said step of making a decision is based on 
whether the values relating to the voicing of last Ktr sub-segments are on the 
one side of the threshold. 

4. A method of any preceeding claim, wherein said step of making a decision 
is based on whether the values relating to the voicing of substantially half of 
the sub-segments of the speech signal segment are on the one side of the 
threshold. 

5. A method of any preceeding claim, wherein said value related to voicing of 
respective speech signal sub-segments comprises an autocorrelation value. 

6. A method of claim 5, wherein said autocorrelation value is determined 
based on the estimated pitch period. 

7. A method of any preceeding claim, wherein the determining the voicing of a 
speech signal segment comprises a voiced/unvoiced decision. 

8. A device for determining the voicing of a speech signal segment, 
comprising means (106) for dividing a speech signal segment into sub- 
segments, means (110) for determining a value relating to the voicing of 
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respective speech signal sub-segments, means (112) for comparing said 
values with a predetermined threshold and means (112) for making a decision 
on the voicing of the speech segment based on the number of the values 
falling on one side of the threshold. 
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9. A device of claim 8, wherein said means for making decision comprises 
means for determining if the value of the last sub-segment is on the one side 
of the threshold. 

10 10. A device of claim 8, wherein said means for making decision comprises 
means for determining if the values of last Ktr sub-segments are on the one 
side of the threshold. 

11. A device of any of claims 8 to 10, wherein said means for making a 
15 decision comprises means for determining whether the values relating to the 

voicing of substantially half of the sub-segments the speech signal segment 
are on the one side of the threshold. 

12. A device of claim 8, wherein the said means for determining a value 
20 relating to the voicing of respective speech signal sub-segments comprises 

means for determining the autocorrelation value. 



